For our project, we will predict the outcomes of Connect 4 games using the state of the grid as input variables. Each square on the 6 x 7 grid has been coded into one of 3 possible states: ‘x’ or ‘o’ representing the two colors of the checker pieces chosen by the opposing players, and ‘b’ representing a blank square. This results in 6 x 7 x 3 = 126 input features, which are then fed into a sequential neural network with a softmax output layer to predict whether games result in a win, loss or draw with respect to the first player (i.e. the player who chose ‘x’).
Connect 4 gameplay
Connect 4 is a popular 2-player game in which players take turns dropping checker pieces of their chosen color into a 6 x 7 suspended grid. The game is won by the first player who successfully forms a horizontal, vertical or diagonal line on the playing grid with 4 of their pieces.
There are 4,531,985,219,092 possible configurations for the position of pieces on a standard 6 x 7 grid, and the game is classified as an adversarial, zero-sum game in game-theoretic terms which means that any advantage to one player is always to the disadvantage of the other. While is is possible in principle to solve the game through brute force computing, the game has also been solved by game theorists and it may be shown that with perfect play, the first player can force a win before the 42nd move of the game. However, our project will not employ any game-theoretical adversarial models but rather use deep learning alone to teach our neural network to recognize winning positions in the game based on observing a large number of labeled games.
While computers have been used to play games for many decades, deep learning has been a relatively new development in the endeavor. The simplest games such as tic-tac-toe may be completely solved using a simple tree search method, such that at any given state of the game the computer program may simply traverse every possible move until it reaches a winning outcome. Tree search methods were also used by Deep Blue (in conjunction with evaluation functions designed by chess experts) when it defeated reigning world chess champion Garry Kasparov in a chess match in 1996.
Illustration of tree search
However, more complicated games such as Go render tree seearch methods infeasible given the state of computing since a game of Go has more possible configurations than the number of atoms in the known universe. Instead of traversing a tree search with brute force, Google DeepMind used (i) monte carlo tree search methods in conjunction with (ii) deep learning to develop the AlphaGo program which recently defeated 9-dan Go champion Lee Sedol in 2016.
Thus, inspired by AlphaGo, we have similarly decided to use deep learning to predict maximum likelihood outcomes Connect 4 games.
Our choice of deep learning architecture is the sequential neural network, which is a basic linear stack of layers of neurons. Our sequential neural network has the following configuration: an input layer which consists of 126 neurons (1 for each input feature of our data), 3 hidden layers with (600, 300, 150) neurons respectively, and an output layer consisting of 3 neurons (1 for each possible outcome).
Illustration of simple neural network with 1 hidden layer of 4 neurons
We decided on using the rectified linear unit (ReLU), i.e. \(f(x) = max(0, x)\) as the activation function in our hidden layers for 2 primary reasons. First, the ReLU has been shown to be computationally fast compared to other activation functions. Second, the ReLU does not suffer from the so-called ‘vanishing gradient problem’ experienced by other activation functions such as the sigmoid or hyperbolic tangent functions, in which the weights of the neuron become impossible to update as the gradient of the function approaches 0. This is because the gradient of the ReLU function is directly proportional to its input for the entire domain of positive real values.
ReLU activation function
tanh activation function
We have chosen the softmax function, i.e. \(\sigma(z) = \frac{e^z_j}{\sum_{k=1}^{K}e^{z_k}}\), for our output layer as it is the natural choice for predicting likelihoods for k > 2 classes. The softmax function represents the output probability as the exponentiated input over the sum of all exponentiated inputs, which avoids the problem of negative probabilities. The final prediction will then be \(\underset{z \in \mathbb{R}}{argmax} \ \sigma(z)\)
We will use the categorical cross-entropy cost function, which is a natural choice for the kind of categorical predictions we will be making. The equation for the cross-entropy cost function is given as follows:
\[C = -\sum_i (y'_i log(y_i) + (1-y'_i)log(1-y_i'))\]
The neural network will be trained using the ‘backpropagation’ algorithm, in which errors are propagated backwards throughout the network in order to update weights. The algorithm was first popularized by Rumelhart, Hinton and Williams in a 1986 paper in Nature1 and is considered the standard method for gradient descent optimization in neural networks.
First, the algorithm calculates the errors in the output layer using the equation \[\delta^L_j = \frac{\partial C}{\partial a^L_j}\sigma'(z_j^L)\], where the first term on the right represents the derivative of the cost function with respect to the j-th output activation function, and the second term represents the derivative of the softmax function with respect to the inputs from the previous layer.
Second, we calculate the error with respect to each layer using the equation \[\delta^l = ((w^{l+1})^T\delta^{l+1})\odot \sigma'(z^l)\], which we can think of as the ‘backwards’ propagation step since each step depends on the derivative of the previous layer of the network.
Third, we compute the derivative of the cost function with respect to the biases in the network using the equation \[\frac{\partial C}{\partial b^l_j} = \delta^l_j\], which is identical to the value of the error calculated in the previous step.
Lastly, we find the derivative of the cost function with respect to the weights in the network using the equation \[\frac{\partial C}{\partial w^l_{jk}} = a_k^{l-1}\delta^l_j\].
This final derivative will be the gradient we will use to update the weights of the network using gradient descent.